NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Image and video tokenization with binary spherical quantization

Zhao, Yue; Xiong, Yuanjun; Krähenbühl, Philipp (April 2025, ICLR)

Free, publicly-accessible full text available April 24, 2026
Image and Video Tokenization with Binary Spherical Quantization

Zhao, Yue; Xiong, Yuanjun; Krähenbühl, Philipp (June 2024, https://doi.org/10.48550/arXiv.2406.07548)

This work introduces a transformer-based image and video tokenizer leveraging Binary Spherical Quantization (BSQ). The method projects high-dimensional visual embeddings onto a lower-dimensional hypersphere followed by binary quantization. BSQ offers three key benefits: (1) parameter efficiency without requiring an explicit codebook, (2) scalability to arbitrary token dimensions, and (3) high compression capability—up to 100× compression of visual data with minimal distortion. The tokenizer architecture includes a transformer encoder-decoder with block-wise causal masking to handle variable-length video inputs. The resulting model, BSQ-ViT, achieves state-of-the-art visual reconstruction performance on image and video benchmarks while delivering 2.4× higher throughput compared to previous best methods. Additionally, BSQ-ViT supports video compression via autoregressive priors for adaptive arithmetic coding, achieving results comparable to leading video compression standards. Furthermore, it enables masked language models to achieve competitive image synthesis quality relative to GAN- and diffusion-based approaches.
more » « less
Full Text Available
Tuber: Tubelet transformer for video action detection

https://doi.org/10.1109/CVPR52688.2022.01323

Zhao, Jiaojiao; Zhang, Yanyi; Li, Xinyu; Chen, Hao; Shuai, Bing; Xu, Mingze; Liu, Chunhui; Kundu, Kaustav; Xiong, Yuanjun; Modolo, Davide; et al (June 2022, IEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops)

We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to directly detect an action tubelet in a video by simultaneously performing action localization and recognition from a single representation. TubeR learns a set of tubelet queries and utilizes a tubelet-attention module to model the dynamic spatio-temporal nature of a video clip, which effectively reinforces the model capacity compared to using actor-positional hypotheses in the spatio-temporal space. For videos containing transitional states or scene changes, we propose a context aware classification head to utilize short-term and long-term context to strengthen action classification, and an action switch regression head for detecting the precise temporal action extent. TubeR directly produces action tubelets with variable lengths and even maintains good results for long video clips. TubeR outperforms the previous state-of-the-art on commonly used action detection datasets AVA, UCF101-24 and JHMDB51-21. Code will be available on GluonCV(https://cv.gluon.ai/).
more » « less
Full Text Available

Search for: All records